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Abstract 

We propose a method to improve traditional character-based PPM text compression algorithms. Consider a text file as a 
sequence of alternating words and non-words, the basic idea of our algorithm is to encode non-words and prefixes of words 
using character-based context models and encode suffixes of words using dictionary models. By using dictionary models, the 
algorithm can encode multiple characters as a whole, and thus enhance the compression efficiency. The advantages of the 
proposed algorithm are: 1) it does not require any text preprocessing; 2) it does not need any explicit codeword to identify 
switch between context and dictionary models; 3) it can be applied to any character-based PPM algorithms without incurring 
much additional computational cost. Test results show that significant improvements can be obtained over character-based PPM, 
especially in low order cases. [] 
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I. Introduction 

Prediction by partial matching (PPM) JJ] has set a benchmark for text compression algorithms due to its high compression 
efficiency. In PPM, texts are modeled as Markov processes in which the occurrence of a character only depends on its context, 
i.e., n preceding characters, where n is called the context order and is a parameter of PPM. In each context, probabilities of 
next character is maintained. When a new character comes, its probability is estimated using context models and is encoded 
by an arithmetic coder. During the encoding/decoding process, the context models (probability tables) are updated after a 
character is encoded/decoded. By using such adaptive context models, PPM is able to predict text as well as human do ||3|, 
and achieves higher compression ratio than other compression algorithms I?). 

However, one limitation of PPM is that the prediction is character-based, i.e., characters are encoded one by one 
sequentially, which is not quite efficient. In tSJ, word-based PPM is proposed in which every word is predicted by its 
preceding words, but its performance is even worse than character-based PPM |4|, mainly because the alphabet size of 
English words is so large that very long texts are required to collect sufficient statistical information for word-based models. 
Similarly, Horspool Q introduced an algorithm that alternatively switches between word-based and character-based PPM, but 
it needs to explicitly encode the length of characters when character-based PPM is used, resulting in unnecessary overhead. 
Recently, Skibiiiski 111 extended the alphabet of PPM to long repeated strings by pre-processing the whole texts before 
encoding, which showed performance improvement over traditional PPM. 

In this paper, we propose an enhanced algorithm that combines traditional character-based context models and dictionary 
models. The basic idea is that, for most English words, given the first a few characters (prefix) the rest of the word (suffix) 
can be well predicted. Specifically, in addition to context models for character prediction used in traditional PPM 121 we 
introduce dictionary models that contain words with a common prefix. By doing so, a word can be predicted and encoded 
given its prefix, i.e., the first a few characters. Therefore, different from traditional PPM ||2l and its variations fS), ||9l, lITOll . 
ifTTl . proposed algorithm can achieve variable-length prediction by partial matching (VLPPM). The remainder of the paper 
is organized as follows. Section |ll] describes the dictionary model used for variable-length prediction and the framework of 
proposed scheme. Section |lll] details the encoding and decoding algorithms. Test results and conclusions are presented in 
Section |IV] and |V] respectively. 

II. Variable-Length Prediction 

A. Dictionary Model For Variable-Length Prediction 

First of all, a simple example is given to show how variable-length prediction can be achieved by a dictionary model. 
Suppose we are encoding a sequence of texts like this: "...information...". The first 3 characters "inf" have been encoded 
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Figure 1: Comparison of encoding the word "information" given that the first 3 characters "inf have been encoded, (a) order-3 PPM 
encoder which processes characters one by one. (b) proposed variable-length prediction using a dictionary model that estimates rest 
characters of the word at once. 



and characters to be encoded next are "ormation...". If we use order-3 PPM, characters are encoded one after another given 
its context, i.e. 3 preceding characters, as shown in Fig. [T] (a). On the other hand, if we divide every word into two parts: 
a fixed-length prefix and a variable-length suffix, and suppose we have a dictionary that contains words with prefix "inf" 
as shown in Fig. [T] (b). Instead of predicting characters one by one, we can predict suffixes in this dictionary at one time, 
provided that we know the prefix is "inf". Since the dictionary contains suffixes with different length, the prediction is 
variable-length. 

Next, let's take a look at the advantage of using dictionary model. We still focus on the above example, and estimate 
how many bits are required to encode "ormation". For traditional order-3 PPM, assume 1.5 bits are needed on average for 
encoding one character, then we need 1.5 x 8 = 12 bits to encode "ormation". Notice that 1.5 bpc is not an unreasonable 
assumption because order-3 PPM usually achieves more than 2 bpc for text compression ID, JS]. On the other hand, as we 
can see from Fig. [T|(b), there are 8 possible words that start with "inf in the dictionary and "information" is one of them. 
Therefore, if we assign different indexes to these 8 words, we can easily encode "ormation" with as few as log2 8 = 3 bits, 
achieving 9 bits saving over order-3 PPM. 

In practice, the dictionary model contains not only words but also their counts so that an arithmetic coder can be used 
to encode any word in the dictionary. Specifically, every dictionary D includes three parts: a common prefix P, a list of 
strings Wi that are suffixes of words with prefix P, and corresponding counts d. Given a prefix, we can find the associated 
dictionary model and then perform encoding/decoding. Moreover, different from context models in character-based PPM in 
which contexts from order-n to order- 1 are used, we only maintain dictionary models with fixed-length prefix (fixed context 
order). If the prefix is too short, each dictionary will contain a lot of words which is not good for efficient compression. On 
the other hand, if the prefix is too long, number of characters that can be predicted by dictionary model will be very small. 
In order to take full advantage of dictionary model, we choose the length of prefix as 3. 

B. Combining Dictionary Model and Context Model 

As we can see, the dictionary model works on word basis. By word we mean a sequence of consecutive English letters. 
We can always parse a text file into a sequence of alternating words and non-words. For non-words, they cannot be encoded 
by dictionary model, and even for the prefix of a word we need to encode character by character. Therefore, dictionary 
model is combined with context model used in traditional character-based PPM. At the beginning of encoding/decoding, 
both context model and dictionary model are empty, and texts are encoded on character basis. After a word is encoded, 
corresponding dictionary model is updated. Detailed encoding and decoding algorithms will be introduced in Section |lll] 

Compared with the methods in ||6l and l?], the above framework has two advantages. First, no extra bits are required to 
indicate the switch from context model to dictionary model, because every time after the prefix of a word (3 consecutive 
English letters) is encoded or decoded by context model, encoder or decoder will automatically switch to dictionary model. 
Moreover, since dictionary models are constructed and updated during encoding/decoding, no pre-processing is required to 
build initial dictionaries. 



III. Algorithm Details 

A. Model Switching Using Finite State Machine ( FSM) 

Any compression algorithms using more than one model face the problem of model switching lfT2l . For example, in 
traditional PPM in which up to n + 1 context models (order-n to order-0) might be used when encoding a character, escape 
code is sent as a signal to let decoder know switch from current order to lower order In our case, we need a mechanism to 
control the switch between dictionary model and context model. We use finite state machine (FSM), and for both encoder 
and decoder there are 3 states: 5*0, Si and 82- The transition rules between states for encoding and decoding are different, 
which will be described next. 

B. Encoding Algorithm 

The encoding process starts at ^o, and the state transition rules are as follows: 

• At 5*0: Encode the next character using context model. If it is an English letter, assign it to an empty string P and 
move to Si; otherwise stay at Sq. 

• At 5*1: Encode the next character using context model. If it is an English letter, append it to string P; otherwise, go 
back to So. If the length of P reaches 3, move to S'2. 

• At S'2: Read consecutive English letters, i.e. suffix string of current word, denoted as W. If a dictionary D associated 
with prefix P if found and W exists in D, encode W using dictionary model D. Otherwise, encode characters in W 
one by one using context models (an escape code should be encoded using dictionary model D if W cannot be found 
in D). Move to So- 

The encoding state transition diagram is depicted in Fig. |2] Pseudo code of the algorithm is provided below. On line 6 and 
10, English{c) is a function that checks character c is an English letter or not. Due to limited space, "break" at the end 
of each "case" clause and the "default" clause are omitted. 
Algorithm 1: VLPPM-Encoder 

1 state So; 

2 while c ^ EOF do 
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encode next character c using context model; 
switch state do 
case 5*0 

if English(c) then 

P ^c; 
state 5*1; 

case 5*1 

if English(c) then P ^ P + c; 
else state Sq; 

if length(P) = 3 then state <— S'2; 
case S2 

read suffix W; 

it find dictionary D with prefix P then 

it find W in D then encode W using D; 
else 

encode escape code using D; 
encode W using context model; 
else encode W using context model; 
update dictionary model; 

state So; 



At state S'2, the probability of escape code is calculated by 
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and the probability for suffix Wi is 
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Figure 2: Encoding algorithm using FSM. "1" and "0" denote the condition that next character to be encoded is an English or a non-English 
letter, respectively. 



C. Decoding Algorithm 

Similar to the encoding algorithm, the decoding algorithm also uses FSM with 3 states and starts at Sq, but with different 
state transition rules: 

• At 5*0: Decode the next character using context model. If it is an English letter, assign it to an empty string P and 
move to Si; otherwise stay at Sq. 

> At 5*1: Decode the next character using context model. If it is an English letter, append it to string P; otherwise, go 
back to Sf). If the length of P reaches 3, decode using dictionary model. If a string of characters are decoded, move 
to 5*0; if an escape code is decoded, move to 82- 

• At 5*2: Decode characters one by one using context model until a non-English letter is decoded. Move to So- 

Algorithm 2: VLPPM-Decoder 

1 state <~ So'-, 

2 while c ^ EOF do 
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decode c using context model; 
switch state do 
case 5*0 

if English(c) then 

P ^ c; 
state 5*1; 

case 5*1 

if English(c) then P 
else state <— So'-, 
if length(P) = 3 then 

if find dictionary D with prefix P then 
decode W using Z?; 
if W is escape code then state S2', 
else update dictionary model; 
else state <— iS'2; 

case 5*2 

decode c using context model; 
if !English(c) then state ^ So', 



D. Exclusion 

As we can see from the decoding algorithm, decoder automatically switches from context model to dictionary model 
once 3 consecutive English letters are decoded. By doing so, we don't need to waste any bits to indicate switches between 
dictionary model and context model. However, this leads to another problem: if a prefix in the dictionary is a word of length 
3 (e.g., "let" can be either a word or a prefix of "lettuce"), then every time this word occurs extra bits will be used by the 
dictionary model to encode an escape code, resulting in performance degradation. To resolve this problem, we introduce an 
exclusion mechanism: after a word is encoded/decoded, if it is a prefix of a dictionary, this dictionary is discarded and this 
prefix is put in a "blacklist" for future reference. During encoding/decoding, if a prefix is in "blacklist", encoder/decoder 
skip dictionary model and use context model directly. 



decode by dictionary model 




Figure 3: Decoding algorithm using FSM. "1" and "0" denote the condition that the next decoded character is an English or a non-English 
letter, respectively. 



Table I: Compression Ratio Comparison of PPM and VLPPM 
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PPM 
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PPM 
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bib 


2.66 


2.25 


18.2% 


2.12 


1.99 


6.5% 


bookl 


2.92 


2.57 


13.6% 


2.48 


2.36 


5.1% 


book2 


2.89 


2.34 


23.5% 


2.27 


2.07 


9.7% 


news 


3.26 


2.87 


13.6% 


2.65 


2.52 


5.2% 


paper 1 


2.94 


2.60 


13.1% 


2.50 


2.40 


4.2% 


paper2 


2.89 


2.52 


14.7% 


2.47 


2.36 


4.7% 


progc 


2.91 


2.71 


7.4% 


2.52 


2.47 


2.0% 


progl 


2.40 


2.13 


12.7% 


1.92 


1.86 


3.2% 


progp 


2.29 


2.05 


11.7% 


1.86 


1.81 


2.8% 


trans 


2.38 


2.08 


14.4% 


1.78 


1.69 


5.3% 


alice2 9 . txt 


2.72 


2.41 


12.9% 


2.31 


2.23 


3.6% 


asyoulik . txt 


2.81 


2.58 


8.9% 


2.53 


2.47 


2.4% 


lcetl0.txt 


2.76 


2.15 


28.4% 


2.19 


1.97 


11.2% 


plrabnl2 . txt 


2.83 


2.51 


12.7% 


2.44 


2.33 


4.7% 


cp , html 


2.73 


2.55 


7.1% 


2.38 


2.34 


1.7% 


fields . c 


2.44 


2.33 


4.7% 


2.18 


2.15 


1.4% 


grammar . Isp 


2.72 


2.60 


4.6% 


2.51 


2.47 


1.6% 


xargs . 1 


3.31 


3.17 


4.4% 


3.08 


3.05 


1.0% 


Average 


2.86 


2.46 


16.3% 


2.36 


2.22 


6.3% 



IV. Performance Evaluation 

A. Compression Efficiency 

In order to show the compression efficiency of the proposed algorithm, VLPPM encoder/decoder are implemented and 
tested on a large set of texts. Specifically, we implemented PPMC encoder/decoder as described in |[8|, and further developed 
VLPPM based on PPMC. Text files from two popular data compression corpora, Calgary corpus |]4| and Canterbury corpus 
ifTSl . are chosen as the data to be compressed. Compression ratio of VLPPM is presented in Table U in terms of bits per 
character (bpc), and is compared with traditional PPM (PPMC). As we can see, using proposed VLPPM algorithm leads to 
considerable performance improvements: 16.3% and 6.3% gains are achieved over traditional PPM for order-2 and order-3, 
respectively. Moreover, although VLPPM implemented here is based on PPMC, it is applicable to any other character-based 
predictive compression schemes, such as all the variations of PPM ID, ||9l, ifTOl . ifTTI . 

In traditional PPM, characters are encoded one by one sequentially. As a result, the number of bits required to encode 
a word increases almost linearly with the word length. However, from the information theoretic point of view this is not 
the case because longer word doesn't necessarily contain more information. VLPPM, on the other hand, predicts several 
characters at once, achieving higher compression efficiency. This is illustrated in Fig. HI in which the average number of 
bits required to encode words with different length is depicted for both PPM and VLPPM. As we can see, the number of 
bits increases almost linearly with word length when PPM is used, in accordance with our analysis above. Compared with 
PPM, VLPPM requires similar number of bits when word length is short, but much fewer bits when word length is long, 
and the longer the word is, the more bits can be saved. 

B. Computational Complexity 

Since VLPPM is based on PPM, it is natural to compare its complexity with that of PPM. Table HI] compares the time 
and memory consumption of VLPPM and PPM encoders for order-2 and order-3. In both cases, the results of VLPPM 
are presented in terms of fraction of time and memory compared with PPM. As we can see, the speed of VLPPM is 
comparable to PPM, and it is even faster than PPM at order-2, due to the variable-length prediction ability. Furthermore, 
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Figure 4: Average number of bits per word when IcetlO . txt is compressed by order-2 PPM and VLPPM. 



Tabl e II: Computational Complexity of PPM and VL PPM 





order-2 


order- 3 




PPM VLPPM 


PPM VLPPM 


Time 


100% 98% 


100% 108% 


Memory 


100% 113% 


100% 108% 



although VLPPM uses dictionary models in addition to context models, the memory consumption only increases by a small 
percentage; 13% and 8% for order-2 and order-3, respectively. This is because VLPPM only maintains dictionaries with 
fixed-length prefix, i.e. prefix with length 3, which means only those words longer than 3 will be stored. Moreover, using 
the exclusion mechanism introduced in the Section IIII-DI excludes certain words from being stored into dictionary which 
further reduces memory use. 

V. Conclusion 

We have presented a text compression algorithm using variable-length prediction by partial matching (VLPPM). By 
introducing dictionary model which contains words with common prefix and combing it with context model used in traditional 
character-based PPM, the proposed method can predict one or more characters at once, further improving the compression 
efficiency without increasing computational complexity a lot. Moreover, the proposed method does not require any text 
preprocessing and can be applied to any other character-based predictive compression algorithms without increasing much 
computational complexity. 
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